Fix UI stuck on “Disconnected” during network-change engine restart#167
Fix UI stuck on “Disconnected” during network-change engine restart#167
Conversation
When EngineRestarter stopped and restarted the Go engine after a network type change, the UI only saw the engine's onDisconnected callback and had no visibility into the reconnect attempt. If the restart stalled (e.g. on a stale management RPC), the UI stayed on Disconnected for the full stall window, making it look like the client never reconnected. Emit onConnecting() from EngineRestarter at stop and at re-launch to keep the UI in the Connecting state throughout the restart, and emit onDisconnected() on error or the 30s safety timeout so a truly failed restart doesn't leave the UI stuck on Connecting.
Pin the process's outgoing sockets to the current default Android Network via ConnectivityManager.bindProcessToNetwork so fresh dials after a WiFi/cellular switch do not stall on TCP SYN retransmits through the departing interface. Skip the initial onAvailable burst fired right after registering the NetworkCallback. That burst reflects current state, not a transition, and was triggering a spurious EngineRestarter restart that cancelled the in-flight login on cold start.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Replace the time-based grace window with an isEngineRunning predicate. The initial onAvailable burst that Android fires right after registerNetworkCallback cannot trigger an EngineRestarter run because the engine is not up yet at that point. Tests updated accordingly; adds coverage for the engine-not-running path.
Use IMPORTANCE_LOW and explicitly clear sound/vibration on the channel so the persistent VPN notification does not play a sound or vibrate on creation or each connection state update.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tool/src/main/java/io/netbird/client/tool/EngineRestarter.java (1)
52-57:⚠️ Potential issue | 🟠 MajorRemove the restart listener when the timeout fires.
After the 30s timeout,
isRestartInProgressis reset andonDisconnected()is emitted, but the anonymousServiceStateListenerremains registered. If the engine stops later, that stale listener can still callrunWithoutAuth()and restart after the timeout path declared failure.🔧 Proposed fix
timeoutCallback = () -> { if (isRestartInProgress) { Log.e(LOGTAG, "engine restart timeout - forcing flag reset"); isRestartInProgress = false; + if (currentListener != null) { + engineRunner.removeServiceStateListener(currentListener); + currentListener = null; + } notifyDisconnected(); } }; @@ public void onStarted() { Log.d(LOGTAG, "engine restarted successfully"); isRestartInProgress = false; // Reset flag on success handler.removeCallbacks(timeoutCallback); // Cancel timeout engineRunner.removeServiceStateListener(this); + currentListener = null; } @@ public void onError(String msg) { Log.e(LOGTAG, "restart failed: " + msg); isRestartInProgress = false; // Resetting flag on error as well handler.removeCallbacks(timeoutCallback); // Cancel timeout engineRunner.removeServiceStateListener(this); + currentListener = null; notifyDisconnected(); }Also applies to: 65-90, 142-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tool/src/main/java/io/netbird/client/tool/EngineRestarter.java` around lines 52 - 57, The timeoutCallback currently resets isRestartInProgress and calls notifyDisconnected() but leaves the anonymous ServiceStateListener registered; update timeoutCallback to also unregister/remove that listener (the same instance registered earlier) when the timeout fires so the stale ServiceStateListener can't later call runWithoutAuth() and trigger another restart; apply the same fix to the other restart-timeout handlers in the file (the blocks around the ServiceStateListener registration in the 65-90 and 142-155 sections) ensuring you keep a reference to the ServiceStateListener instance so you can call the appropriate remove/unregister method when cancelling on success or timeout.tool/src/main/java/io/netbird/client/tool/ForegroundNotification.java (1)
25-31:⚠️ Potential issue | 🟠 MajorUse a new channel ID for silent/low-importance notification settings to affect existing installs.
When
createNotificationChannel()is called with an existing channel ID, Android ignores updates to sound, vibration, and lights properties—these can only be set on initial channel creation. Importance can only be lowered if the user hasn't modified channel settings. Existing users will retain their original audible behavior unless you use a different channel ID.Suggested fix
- String channelId = service.getPackageName(); + String channelId = service.getPackageName() + ".foreground.silent";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tool/src/main/java/io/netbird/client/tool/ForegroundNotification.java` around lines 25 - 31, ForegroundNotification currently re-uses service.getPackageName() as the NotificationChannel ID which prevents changing sound/vibration for existing installs; change to a new, distinct channel ID (e.g., a constant like FOREGROUND_CHANNEL_ID_SILENT or service.getPackageName() + ".fg_silent") used when creating the NotificationChannel in the same creation block (the NotificationChannel constructor and channel.setSound/enableVibration calls) and update any places that build/post the foreground Notification to use that new channel ID so the silent/low-importance settings apply to all users.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@netbird`:
- Line 1: The submodule pointer references commit
5b09078da2ac14550741ce8731e7cf4b4a62a728 which is not reachable; verify whether
that commit exists in the official netbirdio/netbird.git or a private fork, then
update the submodule to a valid commit: check the branch containing the intended
change, fetch the correct commit hash (or switch the submodule URL if it should
point to a different repo), and update the submodule reference (e.g., via git
submodule update --init --remote or by committing the corrected SHA in the
superproject and updating .gitmodules if the URL must change) so cloning with
--recurse-submodules succeeds.
In
`@tool/src/main/java/io/netbird/client/tool/networks/NetworkChangeDetector.java`:
- Around line 63-85: The onAvailable/onLost handlers in
initDefaultNetworkCallback must be guarded by an atomic/current bound-network
state and a callback-active flag so late onLost or onAvailable after unregister
don't undo a newer binding; add a field (e.g., defaultNetworkCallbackActive) set
to true before registering the defaultNetworkCallback and set to false when
unregisterNetworkCallback begins, track the currentlyBoundDefaultNetwork (or
similar) when bindProcessToNetwork(network) succeeds, and in onLost only clear
the binding if the lost network equals the currentlyBoundDefaultNetwork and
defaultNetworkCallbackActive is true; likewise ignore onAvailable if
defaultNetworkCallbackActive is false to avoid rebinding after shutdown.
---
Outside diff comments:
In `@tool/src/main/java/io/netbird/client/tool/EngineRestarter.java`:
- Around line 52-57: The timeoutCallback currently resets isRestartInProgress
and calls notifyDisconnected() but leaves the anonymous ServiceStateListener
registered; update timeoutCallback to also unregister/remove that listener (the
same instance registered earlier) when the timeout fires so the stale
ServiceStateListener can't later call runWithoutAuth() and trigger another
restart; apply the same fix to the other restart-timeout handlers in the file
(the blocks around the ServiceStateListener registration in the 65-90 and
142-155 sections) ensuring you keep a reference to the ServiceStateListener
instance so you can call the appropriate remove/unregister method when
cancelling on success or timeout.
In `@tool/src/main/java/io/netbird/client/tool/ForegroundNotification.java`:
- Around line 25-31: ForegroundNotification currently re-uses
service.getPackageName() as the NotificationChannel ID which prevents changing
sound/vibration for existing installs; change to a new, distinct channel ID
(e.g., a constant like FOREGROUND_CHANNEL_ID_SILENT or service.getPackageName()
+ ".fg_silent") used when creating the NotificationChannel in the same creation
block (the NotificationChannel constructor and channel.setSound/enableVibration
calls) and update any places that build/post the foreground Notification to use
that new channel ID so the silent/low-importance settings apply to all users.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b5d47af5-9d72-4650-b86d-dbd761803e47
📒 Files selected for processing (8)
netbirdtool/src/main/java/io/netbird/client/tool/EngineRestarter.javatool/src/main/java/io/netbird/client/tool/EngineRunner.javatool/src/main/java/io/netbird/client/tool/ForegroundNotification.javatool/src/main/java/io/netbird/client/tool/VPNService.javatool/src/main/java/io/netbird/client/tool/networks/ConcreteNetworkAvailabilityListener.javatool/src/main/java/io/netbird/client/tool/networks/NetworkChangeDetector.javatool/src/test/java/io/netbird/client/tool/ConcreteNetworkAvailabilityListenerUnitTest.java
| @@ -1 +1 @@ | |||
| Subproject commit f01c1eea6ac803fff4678abe93f95b4c965e372b | |||
| Subproject commit 5b09078da2ac14550741ce8731e7cf4b4a62a728 | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify the submodule commit exists and inspect its details
# Check if submodule directory exists and fetch commit details
if [ -d "netbird" ]; then
cd netbird
# Verify commit exists
git cat-file -t 5b09078da2ac14550741ce8731e7cf4b4a62a728 2>/dev/null || echo "ERROR: Commit not found"
# Show commit details
git log -1 --oneline 5b09078da2ac14550741ce8731e7cf4b4a62a728 2>/dev/null
# Check which branches contain this commit
git branch -r --contains 5b09078da2ac14550741ce8731e7cf4b4a62a728 2>/dev/null
else
echo "Submodule directory 'netbird' not found"
fiRepository: netbirdio/android-client
Length of output: 93
🏁 Script executed:
# Check if .gitmodules exists and shows netbird submodule configuration
if [ -f ".gitmodules" ]; then
cat .gitmodules
else
echo ".gitmodules not found"
fiRepository: netbirdio/android-client
Length of output: 153
🏁 Script executed:
# Check the current netbird submodule pointer in the parent repo
git config --file .gitmodules --get-regexp netbird 2>/dev/null || echo "No netbird submodule found in .gitmodules"Repository: netbirdio/android-client
Length of output: 161
🏁 Script executed:
# Verify the commit hash format is valid (40 hex characters)
commit_hash="5b09078da2ac14550741ce8731e7cf4b4a62a728"
if [[ $commit_hash =~ ^[0-9a-f]{40}$ ]]; then
echo "Valid 40-character hex commit hash format"
else
echo "Invalid commit hash format"
fiRepository: netbirdio/android-client
Length of output: 109
🌐 Web query:
NetBird repository commit 5b09078da2ac14550741ce8731e7cf4b4a62a728
💡 Result:
No information found on commit 5b09078da2ac14550741ce8731e7cf4b4a62a728 in the NetBird repository (https://github.com/netbirdio/netbird). Extensive searches across GitHub commits, pull requests, and related pages did not surface this specific SHA. It may be from a private fork, deleted branch, or different repository. The main NetBird repo has recent activity up to April 2026 with commits like 5b85edb (management: omit proxy_protocol), fd9d430 (client: simplify entrypoint), but no match for the queried hash.
Citations:
- 1: https://github.com/netbirdio/netbird/commits/5b85edb7
- 2: https://github.com/netbirdio/netbird/commits/69d4b5d8
- 3: https://github.com/netbirdio/netbird
- 4: https://github.com/netbirdio/netbird/commits/main/
- 5: https://github.com/netbirdio/netbird/tree/refs/heads/main
- 6: netbirdio/netbird@fd9d430
- 7: netbirdio/netbird@b3bbc0e
- 8: netbirdio/netbird@e5914e4
Submodule commit is not reachable in the official NetBird repository. Verify and correct the commit hash.
The submodule is pointing to commit 5b09078da2ac14550741ce8731e7cf4b4a62a728, but this commit does not exist in the official NetBird repository (https://github.com/netbirdio/netbird.git). This will cause cloning to fail with --recurse-submodules.
Verify that:
- The commit exists and is accessible (may be in a private fork or needs a different URL)
- If this is from a test branch, confirm the correct commit hash from that branch
- Update the submodule pointer to a valid, reachable commit before merging
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@netbird` at line 1, The submodule pointer references commit
5b09078da2ac14550741ce8731e7cf4b4a62a728 which is not reachable; verify whether
that commit exists in the official netbirdio/netbird.git or a private fork, then
update the submodule to a valid commit: check the branch containing the intended
change, fetch the correct commit hash (or switch the submodule URL if it should
point to a different repo), and update the submodule reference (e.g., via git
submodule update --init --remote or by committing the corrected SHA in the
superproject and updating .gitmodules if the URL must change) so cloning with
--recurse-submodules succeeds.
| private void initDefaultNetworkCallback() { | ||
| defaultNetworkCallback = new ConnectivityManager.NetworkCallback() { | ||
| @Override | ||
| public void onAvailable(@NonNull Network network) { | ||
| Log.d(LOGTAG, "default network became " + network + ", binding process to it"); | ||
| try { | ||
| if (!connectivityManager.bindProcessToNetwork(network)) { | ||
| Log.w(LOGTAG, "bindProcessToNetwork returned false for " + network); | ||
| } | ||
| } catch (Exception e) { | ||
| Log.e(LOGTAG, "bindProcessToNetwork failed", e); | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public void onLost(@NonNull Network network) { | ||
| Log.d(LOGTAG, "default network " + network + " lost, clearing process binding"); | ||
| try { | ||
| connectivityManager.bindProcessToNetwork(null); | ||
| } catch (Exception e) { | ||
| Log.e(LOGTAG, "bindProcessToNetwork(null) failed", e); | ||
| } | ||
| } |
There was a problem hiding this comment.
Guard process binding with active/current-network state.
onLost() always clears the process binding, so an old default network’s late onLost() can erase a newer binding from onAvailable(). Similarly, an in-flight onAvailable() during unregisterNetworkCallback() can re-bind the process after shutdown. Track the currently bound default network and ignore callbacks once unregister starts.
🔧 Proposed direction
+ private final Object defaultNetworkBindingLock = new Object();
+ private boolean defaultNetworkCallbackActive = false;
+ private Network boundDefaultNetwork;
private void initDefaultNetworkCallback() {
defaultNetworkCallback = new ConnectivityManager.NetworkCallback() {
`@Override`
public void onAvailable(`@NonNull` Network network) {
+ synchronized (defaultNetworkBindingLock) {
+ if (!defaultNetworkCallbackActive) {
+ return;
+ }
+ boundDefaultNetwork = network;
+ }
Log.d(LOGTAG, "default network became " + network + ", binding process to it");
try {
if (!connectivityManager.bindProcessToNetwork(network)) {
Log.w(LOGTAG, "bindProcessToNetwork returned false for " + network);
@@
`@Override`
public void onLost(`@NonNull` Network network) {
+ synchronized (defaultNetworkBindingLock) {
+ if (!defaultNetworkCallbackActive || !network.equals(boundDefaultNetwork)) {
+ return;
+ }
+ boundDefaultNetwork = null;
+ }
Log.d(LOGTAG, "default network " + network + " lost, clearing process binding");
try {
connectivityManager.bindProcessToNetwork(null);
} catch (Exception e) {
@@
public void unregisterNetworkCallback() {
+ synchronized (defaultNetworkBindingLock) {
+ defaultNetworkCallbackActive = false;
+ boundDefaultNetwork = null;
+ }
try {
connectivityManager.unregisterNetworkCallback(networkCallback);Also set defaultNetworkCallbackActive = true before registering the default callback, and reset it if registration fails.
Also applies to: 96-111
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@tool/src/main/java/io/netbird/client/tool/networks/NetworkChangeDetector.java`
around lines 63 - 85, The onAvailable/onLost handlers in
initDefaultNetworkCallback must be guarded by an atomic/current bound-network
state and a callback-active flag so late onLost or onAvailable after unregister
don't undo a newer binding; add a field (e.g., defaultNetworkCallbackActive) set
to true before registering the defaultNetworkCallback and set to false when
unregisterNetworkCallback begins, track the currentlyBoundDefaultNetwork (or
similar) when bindProcessToNetwork(network) succeeds, and in onLost only clear
the binding if the lost network equals the currentlyBoundDefaultNetwork and
defaultNetworkCallbackActive is true; likewise ignore onAvailable if
defaultNetworkCallbackActive is false to avoid rebinding after shutdown.
Title
Fix UI stuck on “Disconnected” during network-change engine restart
Summary
Fixes two issues that caused the NetBird Android app to appear “Disconnected” for up to ~27 seconds during WiFi ↔ cellular transitions, even though the engine eventually recovered.
Problem
Diagnosed from user-reported logs of a WiFi → cellular → WiFi roam:
UI stuck on “Disconnected” during restart
EngineRestarterstops and restarts the Go engine on network-type changes, but neither the restarter nor the engine reported a “Connecting” state while the restart was in progress.As a result, the UI only received
onDisconnectedand waited foronConnected, which could take 20+ seconds if the new connection stalled.Socket dials stalled (~12 seconds)
New sockets (Signal, Management) created immediately after a WiFi switch experienced TCP SYN retransmissions via the outgoing interface.
This happened because sockets were not pinned to the current default network.
Calling
VpnService.protect(fd)prevents VPN routing loops but does not bind the socket to a specific underlying network.Spurious restart on cold start
ConcreteNetworkAvailabilityListenertreated Android’s initialonAvailableburst (triggered right afterregisterNetworkCallback) as a network transition.This caused
EngineRestarterto cancel the first login withcontext canceled.Changes
Emit synthetic connection states during restart
(EngineRestarter.java, EngineRunner.java)
ConnectionListener.onConnecting()whenstop()is invoked and again when restart beginsonDisconnected()on restart failure or after a 30-second timeout to prevent indefinite “Connecting”Bind process to current default network
(NetworkChangeDetector.java)
registerDefaultNetworkCallbackConnectivityManager.bindProcessToNetwork(network)on each default network changeonLostand duringunregisterNetworkCallbackIgnore initial
onAvailableburst(ConcreteNetworkAvailabilityListener.java)
subscribe()Update dependency
Summary by CodeRabbit
Release Notes
Bug Fixes
Refactor
Tests